Udacity’s Data Analysis Nano Degree P4

Exploratory Data Analysis

BY Ankita Mehta

Dataset

Prosper Loan data provided by Udacity

Exploring the dataset

Here we explore the dataset as follows

## [1] "No of data points: 113937"
## [1] "No of features: 81"
##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Some variables need factorization as per their levels.

Univariate Plots Section

As you can notice from the plot provided summary, majority of loans are non classified, and among the classified ones ‘C’ is the majority rating.

## 
##    AA     A     B     C     D     E    HR    NA 
##  5372 14551 15581 18345 14274  9795  6935     0

## 
##     1     2     3     4     5     6     7     8     9    10    11 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750  1456

Here we see another rating method. And again most of the loans are in NA score(not rated). Among the ratings mojority is concentrated in 4-8 score.

## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

This is the distrubution status of loans. As expected most of loans are running in Current status. And another larger portion is completed.

## 
##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

Most of the loan takers are employed or fulltime working people.

##   Not employed             $0      $1-24,999 $25,000-49,999 $50,000-74,999 
##            806            621           7274          32192          31050 
## $75,000-99,999      $100,000+  Not displayed 
##          16916          17337           7741

. Notice the borrower income range plot. Medium income range people are more loan takers, with most of concentration in the people with income range $25000-$75000. These could be customers which are targetted for home loans, car loans, etc.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

Notice that median stated income range is at $4667 and mean of $5608 which is quite average for a professional with few years of experience.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

The data has a long-tailed right-skewed but as expected. It’s expected the majority of people in U.S have a credit history, and the ratio should be low enough for a secured repayment. Seems like 25% is the threshold for most borrower.

Let us try to find out the data’s timeline distribution.

## 
##  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014 
##    22  5906 11460 11552  2047  5652 11228 19553 34345 12172

The data is provided for the year 2005-2014. Most of the loans were originated in the year 2013. Notice there is very less loans in the year 2009. This could be due the market downfall in the year du 2008 Global Financial Crisis.

## Debt Consolidation   Home Improvement           Business 
##              58308               7433               7189 
##      Personal Loan        Student Use               Auto 
##               2395                756               2572 
##    Baby & Adoption               Boat Cosmetic Procedure 
##                199                 85                 91 
##    Engagement Ring        Green Loans Household Expenses 
##                217                 59               1996 
##    Large Purchases     Medical/Dental         Motorcycle 
##                876               1522                304 
##                 RV              Taxes           Vacation 
##                 52                885                768 
##      Wedding Loans              Other      Not Available 
##                771              10494              16965

We notice that not many people wants provide the purpose of loan listing. There’s a surprisingly amount of needs for debt consolidation. My intuition is that as young people are going out to the real world and start to repay their student debt, purchase cars, mortgage their apartment, etc. This is not true people who are settled.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

As we can see that minimum loan amount was for $1000, and maximum loan amount was for $35000. Interestingly, there is peaks at $5000, $10000, $15000, $20000. This tells that people tend to take loans in round off amounts in multiples of $5000.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

Its’ interesting to know the range of borrower’s rate. I’d suspect they have different borrower rate for different term & rating but here we get a median of 18.4%. However, there is a huge notable spike around 31%, which is a very attractive rate for the investors.

## 
##          AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA 
##  5515   200  1679   855  1901 14717  2210  1627   382   300  6720  5008 
##    HI    IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI 
##   409   186   599  5921  2078  1062   983   954  2242  2821   101  3593 
##    MN    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV    NY 
##  2318  2615   787   330  3084    52   674   551  3097   472  1090  6729 
##    OH    OK    OR    PA    RI    SC    SD    TN    TX    UT    VA    VT 
##  4197   971  1817  2972   435  1122   189  1737  6842   877  3278   207 
##    WA    WI    WV    WY 
##  3048  1842   391   150

Highest no of Loans were dispersed for borrowers from state of CA with a count of 14,717. and least were from ND with a count of 30.

Now lets have a look from Investors Prespective

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925

Interestingly, noticing the peak in 0.31, there seems to be a direct relationship between Borrower’s rate(from previous plot) and Lenders’ yeild.

## 
##    12    36    60 
##  1614 87778 24545

We notice that most of the loan are taken for a term of 36 months.

Univariate Analysis

What is the structure of your dataset?

The dataset is comprised of 81(original) variables with 113937 observations. Variables are of classes int, numeric, date, and factor. The dataset includes the loans provided from year 2005-2014. Although 81 variables seems to be too many at first, but on second look at the data, we notice that these variables can be seen as 2 main players: the “Borrowers” variables & “Investors” variables.

What is/are the main feature(s) of interest in your dataset?

During the analysis, it seems that there are 2 main players from the data: the “Borrowers” & “Investors”. For Borrower, I believe the Prosper Score, Proser Rating are the main indicators of a quality of borrowers. And for an Investor, I now understand Lender Yield is the most important factor. I would like to further explore these in my bivariate analysis.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Analysing the time and month of the loan listing, analysing their economic and financial status can help support the further investigation. There is a possibility that Borrower’s state could provide some interesting insights for specific locations where loan is more prevalent.

Did you create any new variables from existing variables in the dataset?

No. So far, I havent created any new variables, although I have created factorized and changed data formats of few variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, there is few unsual distribution. There is high spike in lender yield & borrower rate and the spike in LoanOriginalAmount that people tend to buy in bulk. We also notice that most of the loans were taken for a period of 36 months. Although I am unable to conclude on why this could be happeneing. For tidying up the data I converted some of the variables to its factors, simply to better visualize them into categorical forms.

Bivariate Plots Section

Ignoring the non classified loan ratings, We notice that there is direct linear relationship between the prosper rating and borrowers rate. As the rating moves from one level to another, the borrowers median rating also increases.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0    38400    56000    67300    81900 21000000

The above plot seems little unclear due to many outliers. I would like to rebuild this plot on a subset of data.

The above plot helps us in understanding that yearly income is related to proper Rating.

Lets see the Correlation between DebtToIncomeRatio and BorrowerRate:

## [1] 0.06291678

The relationship is not that significant when we do a correlation test.

Strangely though retired category seems to have higher loan original amount than part-time. In fact part-time employed have the lowest Loan original amount. I am unable to find out the scenario for why this happened. My guess is that may be retired people would need loan for medical expenses, or any other financial needs etc.

Not such an interesting plot for now.

The plot is as expected, nothing interesting to notice here.

This chart is a new interesting insight, although the majority of loan are in 36-month term. The Loan original amount is significantly higher for 60 months term. Let’s see if it’s hold true for Lender Yield.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I wanted to Borrower and Investors patterns; and variable which related to them. So far, the only relationship I found is through the proprietory Prosper Scoring system. Other factor I was trying to compare was not having any particular relationship.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Well, all plots were as expected, so nothing very interesting. One thing was that retired category seems to have higher loan original amount than part-time.

What was the strongest relationship you found?

The relationship between (Prosper Rating and Lender Yield) and (Prosper Rating and Borrower Rate) has an inversed relationship. The higher the rating, the lower the borrower rate and lender yield.

Multivariate Plots Section

Here is how we take a closer look at Lender Yield vs Prosper Rating and how Prosper Rating was influced by Debt to Income Ratio.

This is a closer look for lender yield vs prosper rating. The majority of loans opt-in for 36-month term and the return for 36-month and 60-month are just higher than 12-month, also considering the fact there’re less loan in 12-month term than other term.

Prosper must have optimized their model throughout the year and as we see the borrower throughout the year, the variation between borrower rate is not that significant anymore and we tend to have smaller standard deviation year-over-year. Something worth noticing is the amount of borrowing suddenly decreased in 2013.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Term loan is quite a good indicator whether we have a better Lender Yield or not. Also, we see how three variables Lender Yield, Prosper Rating and Debt To Income Ratio come together and how it affect each order.

Were there any interesting or surprising interactions between features?

There seems to be a fixed borrower rate in criteria HR and AA. This indicates that the criteria for eligibility of AA and HR must be strict.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I did not create any specific models from dataset.

Final Plots and Summary

Plot One:

Description One

One of the most intriguing plot for me. As this tells about the relationship between Prosper rating and lenders yield. The higher the risk, the lower the rating, the better the lender yield. We also noticed high rank like AA would not have DebtToIncome ratio more than 25% and although most borrowers have lower DebtToIncome Ratio, there’re still high DebtToIncome ratio borrowers and fall in lower ProsperRating. Therefore, the shape is upward triangular.

Plot Two

## 
##  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014 
##    22  5906 11460 11552  2047  5652 11228 19553 34345 12172

Description Two

This graph shows how the downfall of business in the year 2009 mainly due to the dot com burst (or the recession) that happened around this time . Once the companies started to stabilize they eventually started to improve over the subsequent years as it can be seen from the graph .

Plot Three

## Debt Consolidation   Home Improvement           Business 
##              58308               7433               7189 
##      Personal Loan        Student Use               Auto 
##               2395                756               2572 
##    Baby & Adoption               Boat Cosmetic Procedure 
##                199                 85                 91 
##    Engagement Ring        Green Loans Household Expenses 
##                217                 59               1996 
##    Large Purchases     Medical/Dental         Motorcycle 
##                876               1522                304 
##                 RV              Taxes           Vacation 
##                 52                885                768 
##      Wedding Loans              Other      Not Available 
##                771              10494              16965

Description Three

Most of loans’ purposes were undeclared. And more than 50% of loans were for Debt Consolidation. We notice that this could be for the probable young customers who start their employment journey by taking home loans, car loans, etc. and hence my guess is an older or settled generation would not be in this category.

Reflection

This dataset seemed to be quite long and not so interesting. I was unable to make much conclusions as their were limited correlation between variables. My limited knowledge in the domain also seem to be affecting my report on insights. I tried to cover the major features in the dataset which I felt required attention. Other Features seemed redundant of not so interesting to me right now.

I focussed my attention on dataset in two prespectives: borrowers and investor. I found some relations between the rates and yields which was very new for me. I also was able to conclude some insights on the purpose of people taking loans.

As part of the future work, I would like to perform feature selection, extraction, in order to get better insights. I would also like to build a logistic regression model on this to predict some of the target features.